This notebook seeks to explore interesting trends, relationships and other features in the data
import cudf
import cupy as cp
import dask_cudf
import numpy as np
# visualisation
#map
import plotly.graph_objects as go
# datashader
import datashader as ds, pandas as pd, colorcet
#holoviews
import holoviews as hv
from holoviews import opts
from holoviews.operation.datashader import datashade, shade, dynspread, spread, rasterize
from holoviews.operation import decimate
# statsmodels
import statsmodels.formula.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import os
hv.extension('plotly')
DATA_PATH = '../data'
RES_PATH = '../res'
Airline IATA Code Data
iata_codes = cudf.read_csv(os.path.join(DATA_PATH, 'iata_codes.csv'), index=False)
df = cudf.read_csv(os.path.join(RES_PATH, 'selected.csv'),index=False)
df.head()
| Year | Quarter | DayOfWeek | FlightDate | IATA_CODE_Reporting_Airline | Origin | OriginCityName | OriginState | Dest | DestState | ... | NASDelay | SecurityDelay | LateAircraftDelay | FirstDepTime | TotalAddGTime | LongestAddGTime | DivAirportLandings | DivReachedDest | DivActualElapsedTime | DivArrDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1998 | 1 | 5 | 1998-01-02 | NW | MSP | Minneapolis, MN | MN | SLC | UT | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
| 1 | 2009 | 2 | 4 | 2009-05-28 | FL | MKE | Milwaukee, WI | WI | MCO | FL | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 0.0 | <NA> | <NA> | <NA> |
| 2 | 2013 | 2 | 6 | 2013-06-29 | MQ | GJT | Grand Junction, CO | CO | DFW | TX | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 0.0 | <NA> | <NA> | <NA> |
| 3 | 2010 | 3 | 2 | 2010-08-31 | DL | LAX | Los Angeles, CA | CA | DTW | MI | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 0.0 | <NA> | <NA> | <NA> |
| 4 | 2006 | 1 | 7 | 2006-01-15 | US | EWR | Newark, NJ | NJ | CLT | NC | ... | 0.0 | 0.0 | 32.0 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
5 rows × 49 columns
This section will attempt to visually examine relationships between singular attributes
The bar chart shows the proportion of flights that are delayed according to the average length of a given flight. This graph looks for a possible relation between time spent in the air and arrival delay. From the histogram below, it seems that (down to around 20 minutes) there is some inverse relation between time spent in the air and the number of flights delayed.
data = df[['AirTime', 'ArrDel15', 'Flights', 'ArrDelayMinutes']]
data['AirTime'] = abs(data['AirTime'])
data = data.loc[(data['AirTime'] > 0) & (data['Flights'] > 0)]
allflights = data[data['ArrDel15'] == 0]
delflights = data[data['ArrDel15'] == 1]
allcounts = allflights.groupby('AirTime').agg('sum').reset_index().drop(['ArrDel15', 'ArrDelayMinutes'],axis=1)
delcounts = delflights.groupby('AirTime').agg('sum').reset_index().drop(['ArrDel15'],axis=1)
allmerged = cudf.merge(allcounts, delcounts, on='AirTime').sort_values('AirTime', ignore_index=True, ascending=True)
allmerged.columns = ['air_time', 'all_flights','delayed_flights', 'arrival_delay']
allmerged['ratio'] = allmerged['delayed_flights']/(allmerged['all_flights'] + allmerged['delayed_flights'])
allmerged.head()
| air_time | all_flights | delayed_flights | arrival_delay | ratio | |
|---|---|---|---|---|---|
| 0 | 1.0 | 6.0 | 1.0 | 58.0 | 0.142857 |
| 1 | 2.0 | 3.0 | 1.0 | 67.0 | 0.250000 |
| 2 | 6.0 | 8.0 | 1.0 | 24.0 | 0.111111 |
| 3 | 7.0 | 15.0 | 2.0 | 228.0 | 0.117647 |
| 4 | 8.0 | 35.0 | 5.0 | 315.0 | 0.125000 |
hv.extension('plotly')
bar = hv.Bars(allmerged, kdims=['air_time'], vdims=['ratio'], cmap='red').opts(width=900, height=400, ylabel='Ratio of Flights Delayed', xlabel='Length of Flight (mins)', title='Ratio of Flights Delayed to Length of Flight', ylim=(0, 1), xticks=list(range(0, 700, 100)))
bar
WARNING:param.Bars46641: Setting non-parameter attribute cmap=red using a mechanism intended only for parameters
Here we look for the relation between a flight arriving delayed if it departed late. The intuition is that time cannot be 'taken back' while flying (although there is a technique in aviation where you fly with the wind to arrive more quickly). From the graph, it seems that there is a fair linear correlation between Arrival Delay and Departure delay (in minutes). The lower range shows a significant increase in arrival delay for small departure delays, this may be investigated furhter
data = df[['DepDelay', 'ArrDelay']]
data = data[(data['DepDelay'] > 0) & (data['ArrDelay'] > 0) ]
data = hv.Dataset(data)
points = hv.Scatter(data, kdims=['DepDelay'], vdims=['ArrDelay'])
decimate(points).opts(cmap=colorcet.kbc, width=800, height=800, xlim=(-10, 400),ylim=(-10, 400), title='Arrival Delay vs Departure Delay')
This section looks for correlation between whether a flight is delayed, and its actual time of arrival. The theory is that certain flights may be delayed more than usual at different times of day due to a difference in traffic, etc. However, from visual inspection of the graph below, there does not seem to be any significant linear relation with arrival delay and arrival time. Additionally, the variance of Arrival Delay does not seem to be constant throughout Arrival Time
data = df[['ArrTime', 'ArrDelay']]
mean_data = data.groupby('ArrTime').agg('mean').reset_index()
mean_data = mean_data.sort_values('ArrTime')
mean_data = hv.Dataset(mean_data)
curve = hv.Curve(mean_data, vdims=['ArrDelay'], kdims=['ArrTime'])
data = hv.Dataset(data)
points = hv.Scatter(data, kdims=['ArrTime'], vdims=['ArrDelay'])
hv.Layout([decimate(points).opts(width=800, height=500, title='Arrival Delay vs Arrival Time'), decimate(curve).opts(width=800, height=500, title='Mean Arrival Delay vs Arrival Time')]).cols(1)
hv.extension('bokeh')
import numpy as np
data = df[['FlightDate', 'ArrDel15', 'DepDel15']].sort_values('FlightDate', ignore_index=True)
data['ArrDel15'] = data['ArrDel15'].astype('uint8')
data['DepDel15'] = data['DepDel15'].astype('uint8')
data['FlightDate'] = pd.to_datetime(data['FlightDate'].to_array())
grouped = data.groupby('FlightDate').count().reset_index().sort_values('FlightDate', ascending=False, ignore_index=True)
grouped = grouped[(grouped.ArrDel15 > 0) & (grouped.DepDel15 > 0)]
# grouped = hv.Dataset(grouped)
delay = hv.Curve((grouped.FlightDate.to_array(), grouped.DepDel15.to_array()), vdims=['DepDel15'], kdims=['FlightDate']).opts(width=1000, height=500, color=cmap[0])
arrival = hv.Curve((grouped.FlightDate.to_array(), grouped.ArrDel15.to_array()), vdims=['ArrDel15'], kdims=['FlightDate']).opts(width=1000, height=500, color=cmap[1])
hv.Layout([delay, arrival]).cols(1)
Aiming to visualise how strongly (or weakly) CRS time correlates with actual arrival and departure times. This may give a sense for how much variance actual arrival and departure times exhibit with-respect-to scheduled (CRS) times. Red-coloured points are those that are delayed by more than 15 minutes
from tqdm import tqdm
hv.extension('plotly')
data = df[['ArrTime', 'CRSArrTime', 'DepTime', 'CRSDepTime', 'ArrDel15', 'DepDel15']]
data['+1'] = 0
data.loc[(data['ArrTime'] < data['DepTime']), '+1'] = 1
cmap = {1: '#EF553B', 0: '#636EFA'}
data_normal = data[data['ArrDel15'] == 0]
data_plus1 = data[data['ArrDel15'] == 1]
points_norm = hv.Scatter(data_normal, kdims=['CRSArrTime'], vdims=['ArrTime'], label='Flights Not Delayed').opts(color=cmap[0])
points_plus1 = hv.Scatter(data_plus1, kdims=['CRSArrTime'], vdims=['ArrTime'], label='Flights Delayed').opts(color=cmap[1])
decimate(points_norm * points_plus1).opts(
width=800,
height=800,
xlim=(-5, 2400),
ylim=(-5, 2400),
title='Actual Arrival Time vs Scheduled Arrival Time')
WARNING:param.Warning: Nesting DynamicMaps within an Overlay makes it difficult to access your data or control how it appears; we recommend calling .collate() on the Overlay in order to follow the recommended nesting structure shown in the Composing Data user guide (http://goo.gl/2YS8LJ)
points = hv.Scatter(data, kdims=['CRSDepTime'], vdims=['DepTime', 'DepDel15']).opts(color=hv.dim('DepDel15').categorize(cmap))
decimate(points).opts(
width=800,
height=800,
xlim=(-5, 2400),
ylim=(-5, 2400),
title='Actual Departure Time vs Scheduled Departure Time')
points = hv.Scatter(df, kdims=['CRSElapsedTime'], vdims=['ActualElapsedTime', 'ArrDel15']).opts(color=hv.dim('ArrDel15').categorize(cmap))
decimate(points).opts(
width=800,
height=800,
xlim=(0, 500),
ylim=(0, 500),
title='Actual Elapsed Time vs Estimated Elapsed Time')
First, we look for the relationship between airline and arrival delay by charting how many flights are delayed as a percentage of all flights flown by the airline. This controls for the idea that more flights for a given airline increases the chance of a flight being delayed, which may unevenly result in larger/busier airlines having aboslutely larger numbers of flight delay
data = df[['IATA_CODE_Reporting_Airline', 'ArrDel15', 'Flights']]
alldata = df[['IATA_CODE_Reporting_Airline', 'Flights']]
alldata = alldata.groupby('IATA_CODE_Reporting_Airline').count('Flights').reset_index().sort_values('Flights', ignore_index=True)
alldata.columns = ['Airline', 'Flights']
data = data[data.ArrDel15 == 1]
data = data.drop('ArrDel15', axis=1)
data = data.groupby('IATA_CODE_Reporting_Airline').count('Flights').reset_index().sort_values('Flights', ignore_index=True)
data.columns = ['Airline', 'Delayed']
ratio_data = cudf.merge(alldata,data, on='Airline')
ratio_data['Percent Delayed'] = (ratio_data['Delayed']/ratio_data['Flights'])*100
ratio_data = ratio_data.drop(['Flights', 'Delayed'], axis=1)
ratio_data = ratio_data.sort_values(['Percent Delayed'], ignore_index=True)
allbars = hv.Bars(ratio_data, kdims=['Airline'], vdims=['Percent Delayed']).redim.values(Airline=ratio_data.Airline.to_array()).opts(width=1200, color='#580aff', title='Percentage Flights Delayed (All Airlines)')
allbars
Zeroing in from the above, it can be shown that Piedmont Aviation, Ukraine International Airways and JetBlue Airways seem have the highest flight delays
topten = ratio_data.iloc[-8:, :].reset_index().drop('index', axis=1)
iata_codes.columns = ['Airline', 'Name']
topten = cudf.merge(topten, iata_codes, on='Airline')
topten.drop('Airline', axis=1, inplace=True)
topten.columns = ['Delayed', 'Airline']
topten = topten.sort_values('Delayed', ignore_index=True)
hv.Bars(topten, kdims=['Airline'], vdims=['Delayed']).redim.values(Airline=topten.Airline.to_array()).opts(color='#d30aff',width=1400, title='Percentage Flights Delayed by Airline (top ten)', ylabel='Percent Delayed')
Exploring to see whether certain days are more prone to having flight delays, from the second bar chart, it seems that the chance of your flight being delayed in the middle of the week iss higher
hv.extension('bokeh')
hv.renderer('bokeh').theme = 'dark_minimal'
ontime = df[df['ArrDel15'] == 0]
delayed = df[df['ArrDel15'] == 1]
ontime_cnt = ontime[['ArrDel15', 'DayOfWeek']].groupby('DayOfWeek').agg('count').sort_index().reset_index()
delayed_cnt = delayed[['ArrDel15', 'DayOfWeek']].groupby('DayOfWeek').agg('sum').sort_index().reset_index()
ontime_cnt['Status'] = 'On Time'
delayed_cnt['Status'] = 'Delayed'
stats = cudf.concat([ontime_cnt, delayed_cnt], axis=0)
stats.columns = ['DayOfWeek', 'Flights', 'Status']
stats['Day'] = 'DAY'
stats.loc[stats['DayOfWeek'] == 1, ['Day']] = 'Sunday'
stats.loc[stats['DayOfWeek'] == 2, ['Day']] = 'Monday'
stats.loc[stats['DayOfWeek'] == 3, ['Day']] = 'Tuesday'
stats.loc[stats['DayOfWeek'] == 4, ['Day']] = 'Wednesday'
stats.loc[stats['DayOfWeek'] == 5, ['Day']] = 'Thursday'
stats.loc[stats['DayOfWeek'] == 6, ['Day']] = 'Friday'
stats.loc[stats['DayOfWeek'] == 7, ['Day']] = 'Saturday'
stats = stats.sort_values('DayOfWeek', ignore_index=True)
hv.Bars(stats, kdims=['DayOfWeek', 'Status'], vdims=['Flights'])\
.opts(title='Number of Delayed Arrivals by Weekday',
width=1200,
height=400).sort('DayOfWeek')
from datetime import datetime
data = df
data['FlightDate'] = cudf.to_datetime(data['FlightDate'])
data['Week'] = data.to_pandas()['FlightDate'].dt.strftime('%W')
arrival_data = data[data['ArrDel15'] == 1]
arrival_data = arrival_data[['Flights', 'Week']]
hv.extension('bokeh')
hv.renderer('bokeh').theme = 'dark_minimal'
arr_grouped = arrival_data.groupby('Week')['Flights'].sum().reset_index().sort_values('Week', ignore_index=True)
arr_grouped['Flights'] /= (2020-1987)
dep_data = data[data['DepDel15'] == 1]
dep_data = dep_data[['Flights', 'Week']]
dep_grouped = dep_data.groupby('Week')['Flights'].sum().reset_index().sort_values('Week', ignore_index=True)
dep_grouped['Flights'] /= (2020-1987)
list_of_curves = [
hv.Curve(dep_grouped, label='Departures').opts(show_grid=True, color='#580aff', xlim=(0,52)),
hv.Curve(arr_grouped, label='Arrivals').opts(show_grid=True, color='#d30aff', xlim=(0,52)),
]
# As a list comprehension
hv.Overlay(list_of_curves).opts(
height=600,
width=1000,
xlabel='Week',
ylabel='Average Delayed',
title='Average Number of Flights Delayed',
legend_position='right'
)
import bokeh
from TOKEN import token
from bokeh.sampledata.airport_routes import airports, routes
airports = airports[['IATA', 'Latitude', 'Longitude']]
data = df[df['ArrDelayMinutes'] > 0]
data = data[['Year','Dest', 'Flights']]
data.columns = ['Year','IATA', 'Flights']
data = data.groupby(['Year', 'IATA']).agg('sum').reset_index().sort_values('Year', ignore_index=True)
airports = cudf.DataFrame(airports)
merged = cudf.merge(data, airports, how='left', on='IATA')
merged = merged.sort_values('Year', ignore_index=True)
merged = merged.dropna()
fig = go.Figure()
merged['Flights'] = merged['Flights']*100000
fig = px.scatter_mapbox(
# locationmode = 'USA-states',
lon = merged['Longitude'].to_array(),
lat = merged['Latitude'].to_array(),
hover_name=merged['IATA'].to_array(),
animation_frame=merged['Year'].to_array(),
size=(merged['Flights'].to_array()),
# scope='usa',
)
fig.update_layout(
mapbox_style="dark", mapbox_accesstoken=token,
autosize=False,
width=1600,
height=800,
showlegend = True
)
fig.show()
data = df.loc[:, [
'ActualElapsedTime',
'AirTime',
'ArrDel15',
'CRSArrTime',
'CRSDepTime',
'CRSElapsedTime',
'DayOfWeek',
'DepDel15',
'DepDelayMinutes',
'DepTime',
'Dest',
'DestState',
'DistanceGroup',
'Flights',
'Origin',
'OriginState',
'Quarter',
'TaxiOut',
'WheelsOff',
'WheelsOn'
]]
data.head()
| ActualElapsedTime | AirTime | ArrDel15 | CRSArrTime | CRSDepTime | CRSElapsedTime | DayOfWeek | DepDel15 | DepDelayMinutes | DepTime | Dest | DestState | DistanceGroup | Flights | Origin | OriginState | Quarter | TaxiOut | WheelsOff | WheelsOn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 180.0 | 153.0 | 1.0 | 1836 | 1640 | 176.0 | 5 | 1.0 | 19.0 | 1659.0 | SLC | UT | 4 | 1.0 | MSP | MN | 1 | 24.0 | 1723.0 | 1856.0 |
| 1 | 159.0 | 141.0 | 0.0 | 1541 | 1204 | 157.0 | 4 | 0.0 | 0.0 | 1202.0 | MCO | FL | 5 | 1.0 | MKE | WI | 2 | 10.0 | 1212.0 | 1533.0 |
| 2 | 118.0 | 103.0 | 0.0 | 1945 | 1630 | 135.0 | 6 | 0.0 | 14.0 | 1644.0 | DFW | TX | 4 | 1.0 | GJT | CO | 2 | 9.0 | 1653.0 | 1936.0 |
| 3 | 250.0 | 220.0 | 0.0 | 2035 | 1305 | 270.0 | 2 | 0.0 | 0.0 | 1305.0 | DTW | MI | 8 | 1.0 | LAX | CA | 3 | 23.0 | 1328.0 | 2008.0 |
| 4 | 107.0 | 80.0 | 1.0 | 2026 | 1820 | 126.0 | 7 | 1.0 | 51.0 | 1911.0 | CLT | NC | 3 | 1.0 | EWR | NJ | 1 | 19.0 | 1930.0 | 2050.0 |
import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_dataframe(data.to_pandas())
tfdv.visualize_statistics(stats)
data = data[
data.DestState.notna() &
data.ActualElapsedTime.notna() &
data.AirTime.notna() &
data.ArrDel15.notna() &
data.TaxiOut.notna() &
data.WheelsOff.notna() &
data.WheelsOn.notna()]
data = data.loc[(data.ActualElapsedTime > 0) & (data.AirTime > 0)]
data.loc[data['DepDel15'].isna()] = 0.0
data.head()
| ActualElapsedTime | AirTime | ArrDel15 | CRSArrTime | CRSDepTime | CRSElapsedTime | DayOfWeek | DepDel15 | DepDelayMinutes | DepTime | Dest | DestState | DistanceGroup | Flights | Origin | OriginState | Quarter | TaxiOut | WheelsOff | WheelsOn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 180.0 | 153.0 | 1.0 | 1836.0 | 1640.0 | 176.0 | 5.0 | 1.0 | 19.0 | 1659.0 | SLC | UT | 4.0 | 1.0 | MSP | MN | 1.0 | 24.0 | 1723.0 | 1856.0 |
| 1 | 159.0 | 141.0 | 0.0 | 1541.0 | 1204.0 | 157.0 | 4.0 | 0.0 | 0.0 | 1202.0 | MCO | FL | 5.0 | 1.0 | MKE | WI | 2.0 | 10.0 | 1212.0 | 1533.0 |
| 2 | 118.0 | 103.0 | 0.0 | 1945.0 | 1630.0 | 135.0 | 6.0 | 0.0 | 14.0 | 1644.0 | DFW | TX | 4.0 | 1.0 | GJT | CO | 2.0 | 9.0 | 1653.0 | 1936.0 |
| 3 | 250.0 | 220.0 | 0.0 | 2035.0 | 1305.0 | 270.0 | 2.0 | 0.0 | 0.0 | 1305.0 | DTW | MI | 8.0 | 1.0 | LAX | CA | 3.0 | 23.0 | 1328.0 | 2008.0 |
| 4 | 107.0 | 80.0 | 1.0 | 2026.0 | 1820.0 | 126.0 | 7.0 | 1.0 | 51.0 | 1911.0 | CLT | NC | 3.0 | 1.0 | EWR | NJ | 1.0 | 19.0 | 1930.0 | 2050.0 |
data['ArrDel15'] = pd.Categorical(data['ArrDel15'].to_array())
data['DepDel15'] = pd.Categorical(data['DepDel15'].to_array())
data['DayOfWeek'] = pd.Categorical(data['DayOfWeek'].to_array())
data['Dest'] = pd.Categorical(data['Dest'].to_array())
data['DestState'] = pd.Categorical(data['DestState'].to_array())
data['DistanceGroup'] = pd.Categorical(data['DistanceGroup'].to_array())
data['Origin'] = pd.Categorical(data['Origin'].to_array())
data['OriginState'] = pd.Categorical(data['OriginState'].to_array())
data['Quarter'] = pd.Categorical(data['Quarter'].to_array())
columns = ['ActualElapsedTime', 'AirTime', 'CRSArrTime', 'CRSDepTime',
'CRSElapsedTime', 'DayOfWeek', 'DepDel15', 'DepDelayMinutes', 'DepTime',
'DistanceGroup', 'Flights', 'Quarter', 'TaxiOut', 'WheelsOff', 'WheelsOn']
data[columns] = data[columns].astype('int32')
data.head()
| ActualElapsedTime | AirTime | ArrDel15 | CRSArrTime | CRSDepTime | CRSElapsedTime | DayOfWeek | DepDel15 | DepDelayMinutes | DepTime | Dest | DestState | DistanceGroup | Flights | Origin | OriginState | Quarter | TaxiOut | WheelsOff | WheelsOn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 180 | 153 | 1.0 | 1836 | 1640 | 176 | 5 | 1 | 19 | 1659 | SLC | UT | 4 | 1 | MSP | MN | 1 | 24 | 1723 | 1856 |
| 1 | 159 | 141 | 0.0 | 1541 | 1204 | 157 | 4 | 0 | 0 | 1202 | MCO | FL | 5 | 1 | MKE | WI | 2 | 10 | 1212 | 1533 |
| 2 | 118 | 103 | 0.0 | 1945 | 1630 | 135 | 6 | 0 | 14 | 1644 | DFW | TX | 4 | 1 | GJT | CO | 2 | 9 | 1653 | 1936 |
| 3 | 250 | 220 | 0.0 | 2035 | 1305 | 270 | 2 | 0 | 0 | 1305 | DTW | MI | 8 | 1 | LAX | CA | 3 | 23 | 1328 | 2008 |
| 4 | 107 | 80 | 1.0 | 2026 | 1820 | 126 | 7 | 1 | 51 | 1911 | CLT | NC | 3 | 1 | EWR | NJ | 1 | 19 | 1930 | 2050 |
To encode: Dest DestState Origin OriginState
dest_cat = dict(enumerate(data.to_pandas()['Dest'].cat.categories))
deststate_cat = dict(enumerate(data.to_pandas()['DestState'].cat.categories))
origin_cat = dict(enumerate(data.to_pandas()['Origin'].cat.categories))
originstate_cat = dict(enumerate(data.to_pandas()['OriginState'].cat.categories))
data['Dest_enc'] = data['Dest'].cat.codes
data['DestState_enc'] = data['DestState'].cat.codes
data['Origin_enc'] = data['Origin'].cat.codes
data['OriginState_enc'] = data['OriginState'].cat.codes
data.head()
| ActualElapsedTime | AirTime | ArrDel15 | CRSArrTime | CRSDepTime | CRSElapsedTime | DayOfWeek | DepDel15 | DepDelayMinutes | DepTime | ... | Origin | OriginState | Quarter | TaxiOut | WheelsOff | WheelsOn | Dest_enc | DestState_enc | Origin_enc | OriginState_enc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 180 | 153 | 1.0 | 1836 | 1640 | 176 | 5 | 1 | 19 | 1659 | ... | MSP | MN | 1 | 24 | 1723 | 1856 | 350 | 46 | 258 | 23 |
| 1 | 159 | 141 | 0.0 | 1541 | 1204 | 157 | 4 | 0 | 0 | 1202 | ... | MKE | WI | 2 | 10 | 1212 | 1533 | 233 | 9 | 244 | 51 |
| 2 | 118 | 103 | 0.0 | 1945 | 1630 | 135 | 6 | 0 | 14 | 1644 | ... | GJT | CO | 2 | 9 | 1653 | 1936 | 101 | 45 | 144 | 6 |
| 3 | 250 | 220 | 0.0 | 2035 | 1305 | 270 | 2 | 0 | 0 | 1305 | ... | LAX | CA | 3 | 23 | 1328 | 2008 | 109 | 22 | 207 | 5 |
| 4 | 107 | 80 | 1.0 | 2026 | 1820 | 126 | 7 | 1 | 51 | 1911 | ... | EWR | NJ | 1 | 19 | 1930 | 2050 | 79 | 27 | 124 | 31 |
5 rows × 24 columns
data = data.drop(['Dest', 'DestState', 'Origin', 'OriginState'], axis=1)
data['ArrDel15'] = data['ArrDel15'].astype('uint8')
data.head()
| ActualElapsedTime | AirTime | ArrDel15 | CRSArrTime | CRSDepTime | CRSElapsedTime | DayOfWeek | DepDel15 | DepDelayMinutes | DepTime | DistanceGroup | Flights | Quarter | TaxiOut | WheelsOff | WheelsOn | Dest_enc | DestState_enc | Origin_enc | OriginState_enc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 180 | 153 | 1 | 1836 | 1640 | 176 | 5 | 1 | 19 | 1659 | 4 | 1 | 1 | 24 | 1723 | 1856 | 350 | 46 | 258 | 23 |
| 1 | 159 | 141 | 0 | 1541 | 1204 | 157 | 4 | 0 | 0 | 1202 | 5 | 1 | 2 | 10 | 1212 | 1533 | 233 | 9 | 244 | 51 |
| 2 | 118 | 103 | 0 | 1945 | 1630 | 135 | 6 | 0 | 14 | 1644 | 4 | 1 | 2 | 9 | 1653 | 1936 | 101 | 45 | 144 | 6 |
| 3 | 250 | 220 | 0 | 2035 | 1305 | 270 | 2 | 0 | 0 | 1305 | 8 | 1 | 3 | 23 | 1328 | 2008 | 109 | 22 | 207 | 5 |
| 4 | 107 | 80 | 1 | 2026 | 1820 | 126 | 7 | 1 | 51 | 1911 | 3 | 1 | 1 | 19 | 1930 | 2050 | 79 | 27 | 124 | 31 |
import geoviews as gv
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-2-8ce98cb33251> in <module> ----> 1 import geoviews as gv ModuleNotFoundError: No module named 'geoviews'
model = sm.ols("""ArrDel15 ~ ActualElapsedTime +
AirTime+ CRSArrTime+ CRSDepTime+ CRSElapsedTime+ DayOfWeek+ DepDel15+ DepDelayMinutes+ DepTime+ DistanceGroup+ Flights+ Quarter+ TaxiOut+ WheelsOff+ WheelsOn+ Dest_enc+ DestState_enc+ Origin_enc+ OriginState_enc""",data=data.to_pandas()).fit()
print_model = model.summary()
print(print_model)
OLS Regression Results
==============================================================================
Dep. Variable: ArrDel15 R-squared: 0.656
Model: OLS Adj. R-squared: 0.656
Method: Least Squares F-statistic: 1.589e+05
Date: Sun, 27 Jun 2021 Prob (F-statistic): 0.00
Time: 20:18:32 Log-Likelihood: 47902.
No. Observations: 1580460 AIC: -9.576e+04
Df Residuals: 1580440 BIC: -9.552e+04
Df Model: 19
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept -1.006e-13 0.030 -3.32e-12 1.000 -0.059 0.059
ActualElapsedTime 0.0104 3.71e-05 279.282 0.000 0.010 0.010
AirTime -7.575e-06 3.99e-05 -0.190 0.849 -8.57e-05 7.05e-05
CRSArrTime 7.327e-06 8.77e-07 8.359 0.000 5.61e-06 9.05e-06
CRSDepTime -3.308e-05 1.4e-06 -23.592 0.000 -3.58e-05 -3.03e-05
CRSElapsedTime -0.0097 1.93e-05 -501.884 0.000 -0.010 -0.010
DayOfWeek 0.0004 9.38e-05 3.947 0.000 0.000 0.001
DepDel15 0.6367 0.001 1002.249 0.000 0.635 0.638
DepDelayMinutes 0.0017 7.21e-06 241.732 0.000 0.002 0.002
DepTime 2.74e-05 2.16e-06 12.657 0.000 2.32e-05 3.16e-05
DistanceGroup -0.0131 0.000 -34.174 0.000 -0.014 -0.012
Flights 0.0237 0.030 0.781 0.435 -0.036 0.083
Quarter -0.0009 0.000 -5.176 0.000 -0.001 -0.001
TaxiOut 0.0006 4.13e-05 14.262 0.000 0.001 0.001
WheelsOff 1.99e-05 1.82e-06 10.962 0.000 1.63e-05 2.35e-05
WheelsOn -3.77e-06 8.09e-07 -4.657 0.000 -5.36e-06 -2.18e-06
Dest_enc -9.014e-06 1.76e-06 -5.112 0.000 -1.25e-05 -5.56e-06
DestState_enc -4.569e-05 1.21e-05 -3.781 0.000 -6.94e-05 -2.2e-05
Origin_enc -1.094e-05 1.77e-06 -6.193 0.000 -1.44e-05 -7.48e-06
OriginState_enc 5.025e-05 1.22e-05 4.121 0.000 2.63e-05 7.42e-05
==============================================================================
Omnibus: 251921.459 Durbin-Watson: 1.999
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3003934.800
Skew: 0.393 Prob(JB): 0.00
Kurtosis: 9.708 Cond. No. 7.60e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.6e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Variable Inflation Factor regresses each input feature as a function of all other features, typicalls VIFs with values above 5 or 10 should be excluded from the feature set, since this generally implies a strong relationship between the feature in question and other features
From the below, AirTime, DistanceGroup, Number of Flights and TaxiOut time are potential candidates for dropping
X = data.drop('ArrDel15', axis=1).to_pandas()
vif = cudf.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.values.shape[1])]
vif["features"] = X.columns
print(vif.round(1))
VIF Factor features 0 -6.5 ActualElapsedTime 1 -17.3 AirTime 2 -0.0 CRSArrTime 3 0.1 CRSDepTime 4 6.1 CRSElapsedTime 5 4.9 DayOfWeek 6 2.1 DepDel15 7 1.9 DepDelayMinutes 8 0.1 DepTime 9 73.5 DistanceGroup 10 40.3 Flights 11 6.0 Quarter 12 17.1 TaxiOut 13 0.0 WheelsOff 14 0.0 WheelsOn 15 0.1 Dest_enc 16 3.3 DestState_enc 17 0.0 Origin_enc 18 3.4 OriginState_enc
X = data.drop(['ArrDel15', 'AirTime', 'DistanceGroup', 'Flights', 'TaxiOut'], axis=1).to_pandas()
vif = cudf.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.values.shape[1])]
vif["features"] = X.columns
print(vif.round(1))
VIF Factor features 0 -0.9 ActualElapsedTime 1 -0.0 CRSArrTime 2 0.1 CRSDepTime 3 3.2 CRSElapsedTime 4 4.4 DayOfWeek 5 2.1 DepDel15 6 1.9 DepDelayMinutes 7 0.1 DepTime 8 5.1 Quarter 9 0.0 WheelsOff 10 0.0 WheelsOn 11 0.1 Dest_enc 12 3.0 DestState_enc 13 0.0 Origin_enc 14 3.0 OriginState_enc
Rebuilding the model with the above features excluded
model = sm.ols("""ArrDel15 ~ ActualElapsedTime +
CRSArrTime + CRSDepTime + CRSElapsedTime + DayOfWeek + DepDel15+ DepDelayMinutes + Quarter + Dest_enc + Origin_enc""",data=data.to_pandas()).fit()
print_model = model.summary()
print(print_model)
OLS Regression Results
==============================================================================
Dep. Variable: ArrDel15 R-squared: 0.655
Model: OLS Adj. R-squared: 0.655
Method: Least Squares F-statistic: 3.007e+05
Date: Sun, 27 Jun 2021 Prob (F-statistic): 0.00
Time: 20:31:55 Log-Likelihood: 45876.
No. Observations: 1580460 AIC: -9.173e+04
Df Residuals: 1580449 BIC: -9.159e+04
Df Model: 10
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 0.0436 0.001 43.974 0.000 0.042 0.046
ActualElapsedTime 0.0106 1.4e-05 753.789 0.000 0.011 0.011
CRSArrTime 7.919e-06 5.85e-07 13.532 0.000 6.77e-06 9.07e-06
CRSDepTime 8.526e-06 6.12e-07 13.923 0.000 7.33e-06 9.73e-06
CRSElapsedTime -0.0103 1.4e-05 -735.086 0.000 -0.010 -0.010
DayOfWeek 0.0003 9.39e-05 3.354 0.001 0.000 0.000
DepDel15 0.6396 0.001 1012.805 0.000 0.638 0.641
DepDelayMinutes 0.0018 7.14e-06 248.429 0.000 0.002 0.002
Quarter -0.0010 0.000 -5.902 0.000 -0.001 -0.001
Dest_enc -9.543e-06 1.74e-06 -5.473 0.000 -1.3e-05 -6.13e-06
Origin_enc -2.428e-05 1.75e-06 -13.904 0.000 -2.77e-05 -2.09e-05
==============================================================================
Omnibus: 256013.711 Durbin-Watson: 1.999
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3139263.195
Skew: 0.398 Prob(JB): 0.00
Kurtosis: 9.858 Cond. No. 1.13e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
From the above, it seems that the most important indicator of a flight arriving late is whether is has a departure delay greater than 15 minutes
model = sm.ols("""ArrDel15 ~ ActualElapsedTime +
CRSArrTime + CRSDepTime + CRSElapsedTime + DayOfWeek + Quarter + Dest_enc + Origin_enc""",data=data.to_pandas()).fit()
print_model = model.summary()
print(print_model)
OLS Regression Results
==============================================================================
Dep. Variable: ArrDel15 R-squared: 0.161
Model: OLS Adj. R-squared: 0.161
Method: Least Squares F-statistic: 3.802e+04
Date: Mon, 28 Jun 2021 Prob (F-statistic): 0.00
Time: 13:31:30 Log-Likelihood: -6.5709e+05
No. Observations: 1580460 AIC: 1.314e+06
Df Residuals: 1580451 BIC: 1.314e+06
Df Model: 8
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 0.0251 0.002 16.199 0.000 0.022 0.028
ActualElapsedTime 0.0113 2.19e-05 516.994 0.000 0.011 0.011
CRSArrTime 3.663e-05 9.13e-07 40.144 0.000 3.48e-05 3.84e-05
CRSDepTime 8.533e-05 9.52e-07 89.624 0.000 8.35e-05 8.72e-05
CRSElapsedTime -0.0110 2.19e-05 -501.155 0.000 -0.011 -0.011
DayOfWeek 0.0012 0.000 8.475 0.000 0.001 0.002
Quarter -0.0030 0.000 -11.347 0.000 -0.003 -0.002
Dest_enc 9.217e-07 2.72e-06 0.339 0.735 -4.41e-06 6.25e-06
Origin_enc -3.521e-05 2.72e-06 -12.926 0.000 -4.05e-05 -2.99e-05
==============================================================================
Omnibus: 328075.089 Durbin-Watson: 1.999
Prob(Omnibus): 0.000 Jarque-Bera (JB): 579920.483
Skew: 1.364 Prob(JB): 0.00
Kurtosis: 4.166 Cond. No. 1.13e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.